In the competitive landscape of order processing, the promptness and
efficiency of acknowledging orders are pivotal for sustaining high
customer satisfaction and operational effectiveness. However, an
analysis of this sample dataset reveals a concerning trend: a
significant portion of order acknowledgments are not being made on time.
This inefficiency poses a risk not only to customer satisfaction but
also to the reliability of the order fulfillment process. The aim of
this analysis is to examine into the underlying causes of these delays
by examining the days it takes to acknowledge orders and exploring
variations across different dimensions such as profile owner, location,
and leader. Through descriptive analysis and K-means clustering, we seek
to uncover patterns, bottlenecks, and actionable insights that can
ultimately lead to process optimizations. Identifying distinct clusters
of order behaviors and acknowledgment times will allow us to pinpoint
specific areas for improvement, thereby enhancing process efficiencies
and ensuring timely order acknowledgments. The ultimate goal is to
transform these insights into strategic actions that elevate operational
performance and customer service levels.
Load the Data into R
Descriptive Analysis
Conduct a thorough descriptive analysis
to gain a foundational understanding of the dataset. This includes
generating summary statistics, analyzing the distribution of days to
acknowledge across various factors, and visualizing data to uncover
initial insights and patterns.
Determine the Optimal Number of Clusters:
Utilize the Elbow
method to ascertain the optimal number of clusters for the dataset. This
technique helps identify a point where increasing the number of clusters
does not significantly improve the model’s fit, balancing between
simplicity and explanatory power.
Perform K-means Clustering:
Apply K-means clustering to
segment orders based on acknowledgment times and other relevant
characteristics. This unsupervised learning approach will categorize
orders into clusters with similar features, revealing inherent groupings
within the data.
Visualize the Clusters:
Visualize the resulting clusters to
gain insights into the distinct groupings of orders. This step will help
identify patterns, trends, and differences across clusters, providing a
clear understanding of the order acknowledgment behaviors and the
factors contributing to late acknowledgments.
Conclusion:
After loading these essential libraries, we can proceed to load and
initially inspect our dataset. The dataset, order_late, contains
information about order acknowledgments, including whether they were
made on time or not. The dataset also includes details about the profile
owner, leader, location, and other relevant attributes that can be used
to understand the patterns and factors contributing to late
acknowledgments. Let’s start by loading the data and taking a look at
the first few rows to understand its structure and contents.
library(tidyverse)
library(DT)
library(lubridate)
library(cluster)
library(factoextra)
library(shiny)
order_late %>%
DT::datatable(
extensions = 'Buttons',
options = list(
dom = 'Blfrtip',
buttons = c('copy', 'csv', 'excel'),
pageLength = 5,
scrollX = TRUE
)
)
profile_owner: The identifier of the individual who owns the profile related to the order.
leader_name: The identifier of the leadership or supervisory figure associated with the order or the profile owner.
loc: A code or number that represents the location where the order was processed or is to be fulfilled from.
order: The unique identifier assigned to the order.
customer: The name of the individual or entity to whom the order will be delivered.
order_date: The date on which the order was placed or recorded.
week_number: The week of the year when the order was placed, which could be useful for seasonal analysis.
delivery_date: The date when the order is scheduled to be delivered to the customer.
ship_date: The actual date when the order was shipped out from the facility.
date_acknowledge: The date on which the order acknowledgment was recorded in the system.
date_acknowledgement_calc: Calculated date for when the order was supposed to be acknowledged, possibly used for performance tracking.
days_to_acknowledge: The number of days it took to acknowledge the order from the order date, a measure of processing time.
on_time: An indicator of whether the order acknowledgment was within the expected time frame, with values like ‘On Time’ = 1 or ’Not on Time = 0
These columns together can provide valuable insights into the order
processing efficiency and timeliness. Understanding patterns and
relationships within these columns through clustering or other data
analysis methods could help in identifying bottlenecks, predicting
future performance, and improving overall service delivery.
Before diving into complex analytical techniques, it’s crucial to
start with a descriptive analysis of our dataset. This beginning step
will allow us to understand the basic characteristics of the data,
identify any immediate patterns, and set the stage for more in-depth
analysis.
### 2-1. Summary Statistics
order_late %>% dplyr::summarise(
Mean = mean(days_to_acknowledge, na.rm = TRUE),
Median = median(days_to_acknowledge, na.rm = TRUE),
Min = min(days_to_acknowledge, na.rm = TRUE),
Max = max(days_to_acknowledge, na.rm = TRUE),
SD = sd(days_to_acknowledge, na.rm = TRUE)
)
Mean: The average number of days to acknowledge an order is
approximately 51.66 days. This indicates the central tendency of our
dataset, suggesting that on average, orders take about 52 days to be
acknowledged.
Median: The median days to acknowledge is 52, which means half of
the orders are acknowledged in less than 52 days, and the other half
takes longer.
Minimum (Min): The fastest acknowledgment time recorded is 2
days, indicating that some orders are acknowledged almost immediately
after being placed.
Maximum (Max): On the other end, the longest time taken to
acknowledge an order is 105 days, suggesting significant delays in some
cases.
Standard Deviation (SD): With a standard deviation of
approximately 31.99, there’s considerable variability in the
acknowledgment times. This high variability indicates that the
acknowledgment process’s efficiency varies widely across different
orders.
The considerable gap between the minimum and maximum values,
along with a high standard deviation, suggests that while some orders
are processed efficiently, others face substantial delays.
This histogram provides a graphical representation of the frequency distribution and is an essential tool for spotting trends and patterns that might not be evident from the summary statistics alone.
order_late %>%
ggplot(aes(x = days_to_acknowledge)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
labs(title = "Distribution of Days to Acknowledge",
x = "Days to Acknowledge",
y = "Frequency") +
theme_classic()
- The data appears to be right-skewed, indicating that while
most orders are acknowledged within a shorter period, there is a long
tail of orders that take much longer to be acknowledged.
- There
is a high frequency of orders that are acknowledged in just a few days
after being placed, as shown by the tall bars at the lower end of the
histogram.
- The presence of bars across the entire range up to
100 days illustrates variability in the acknowledgment times across
different orders.
Exploring the distribution of acknowledgment times across different profile owners can reveal individual or systemic factors influencing the efficiency of order processing. By breaking down the histogram of days to acknowledge for each profile owner. Here, I aim to uncover:
order_late %>%
ggplot(aes(x = days_to_acknowledge)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
labs(title = "Distribution of Days to Acknowledge by Profile Owner",
x = "Days to Acknowledge",
y = "Frequency") +
facet_wrap(~profile_owner) +
theme_classic()
- we can note the following observations for potential areas
of focus:
- Profile owners such as Andrew Bates and April Lynch
show a concentration of acknowledgments within the swift timeframe,
suggesting an efficient acknowledgment process.
- Other
profiles, for example, Christopher Marti and Dakota Young, display a
wider spread of acknowledgment times, indicating a more variable process
that could benefit from a review to understand the causes of
delays.
- It’s important to note that while a
right-skewed distribution is generally favorable in this context, any
extensive right tail or outliers can still highlight opportunities for
improvement.
We can target these specific areas with training,
process adjustments, or other interventions to streamline acknowledgment
times further. The goal is not only to maintain quick processing for
most orders but also to reduce the frequency and extent of any outliers,
ensuring a consistently high-performing acknowledgment process across
all profile owners.
Assessing the days to acknowledge by location, a right-skewed
distribution generally signifies prompt acknowledgment of orders—this
skewness indicates a location’s strong performance in quickly processing
most of its orders.
order_late %>%
ggplot(aes(x = days_to_acknowledge)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
labs(title = "Distribution of Days to Acknowledge by Location",
x = "Days to Acknowledge",
y = "Frequency") +
facet_wrap(~loc) +
theme_classic()
- Location 5: The pronounced right skewness here is an
indicator of exceptional performance, with the bulk of orders being
acknowledged very swiftly and only a few exceptions taking longer.
-
Location 28: Demonstrates similar right skewness to Location 5,
suggesting that the location efficiently acknowledges most orders, with
rare delays.
Across all locations, understanding the right skewness within the context of order acknowledgment times is valuable. It allows for the recognition of high-performing locations, providing a benchmark for others, and highlights the necessity to address the exceptional cases in the tail to achieve consistent, organization-wide operational excellence.
The distribution of days to acknowledge orders when segmented by
leaders can offer insights into management effectiveness and team
performance. Analyzing these distributions helps identify which leaders
are overseeing processes that ensure orders are acknowledged promptly
and which may need to address delays within their teams.
order_late %>%
ggplot(aes(x = days_to_acknowledge)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
labs(title = "Distribution of Days to Acknowledge by Leader",
x = "Days to Acknowledge",
y = "Frequency") +
facet_wrap(~leader_name) +
theme_classic()
Analyzing the distribution of days to acknowledge by week number can
provide insights into the operational trends over time and the influence
of seasonal factors.
order_late %>%
ggplot(aes(x = days_to_acknowledge)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
labs(title = "Distribution of Days to Acknowledge by Week Number",
x = "Days to Acknowledge",
y = "Frequency") +
facet_wrap(~week_number) +
theme_classic()
The histograms segmented by week number reveal:
- Some
weeks suggesting quick acknowledgments. For example, weeks 1, 2, and 3
have significant right skewed, indicative of a strong start to the
year.
- Such as week 40, indicating most orders are acknowledged
very late.
This temporal analysis can highlight periods where the process may
require adjustment due to increased order volumes or staff availability
issues, such as holiday seasons or financial year-ends. Understanding
these patterns is crucial for planning resources, managing workload, and
setting realistic timelines for order processing.
This analysis provides a statistical breakdown of order acknowledgment times, distinguishing between orders acknowledged within the expected timeframe (on time) and those that were not (not on time).
order_late %>%
group_by(on_time) %>%
summarise(
Mean_days_to_acknowledge = mean(days_to_acknowledge, na.rm = TRUE),
Median_days_to_acknowledge = median(days_to_acknowledge, na.rm = TRUE),
SD_days_to_acknowledge = sd(days_to_acknowledge, na.rm = TRUE),
Min_days_to_acknowledge = min(days_to_acknowledge, na.rm = TRUE),
Max_days_to_acknowledge = max(days_to_acknowledge, na.rm = TRUE),
Count = n()
)
order_late %>%
select(days_to_acknowledge) %>%
scale() %>%
head(10)
## days_to_acknowledge
## [1,] -1.020829
## [2,] -1.114590
## [3,] -1.114590
## [4,] -1.114590
## [5,] -1.114590
## [6,] -1.114590
## [7,] -1.114590
## [8,] -1.083336
## [9,] -1.489634
## [10,] -1.489634
I used ‘scale’ function to normilize the distribution of values in a dataset. It typically subtracts the mean and divides by the standard deviation for each value. This process, called z-score normalization, transforms the data into a distribution with a mean of 0 and a standard deviation of 1.
Negative Values: The negative values indicate that these data
points are below the mean of the days_to_acknowledge distribution in
your dataset.
Magnitude of the Values: The magnitude of the negative numbers
reflects how many standard deviations away from the mean each value is.
For instance, a value of -1.02 means the corresponding
days_to_acknowledge is 1.02 standard deviations below the
mean.
Similar Values: The repeating value of -1.114590 suggests that
several orders have the same days_to_acknowledge value, which is also
1.114590 standard deviations below the mean.
Z-score: Each number is a z-score, representing how many standard
deviations a value is from the mean. A z-score close to 0 would mean the
value is near the average, while a higher magnitude (positive or
negative) means it is further from the average.
Normalization is an important step before clustering because it
puts all variables on the same scale, so one feature doesn’t dominate
the others due to a difference in units or spread of values. When run
clustering algorithms like K-means, which rely on distance measures,
having features on the same scale means that each feature contributes
equally to the distance calculations.
# Selecting numeric columns and removing rows with NA values
numeric_data <- dplyr::select_if(order_late, is.numeric) %>%
na.omit()
# Looks like, we should remove loc and order columns since they are categorical values
numeric_data <- numeric_data %>%
dplyr::select(-loc, -order)
# Confirming the structure
str(numeric_data)
## tibble [21,937 × 3] (S3: tbl_df/tbl/data.frame)
## $ week_number : num [1:21937] 41 42 42 42 42 42 42 42 1 1 ...
## $ days_to_acknowledge: num [1:21937] 19 16 16 16 16 16 16 17 4 4 ...
## $ on_time : num [1:21937] 0 0 0 0 0 0 0 0 0 0 ...
# Calculate WCSS for each number of clusters
wcss <- purrr::map_dbl(1:10, function(k) {
kmeans(numeric_data, centers = k, nstart = 25)$tot.withinss
})
# Create a tibble and plot the Elbow curve
tibble::tibble(k = 1:10, wcss = wcss) %>%
ggplot2::ggplot(ggplot2::aes(x = k, y = wcss)) +
ggplot2::geom_line() +
ggplot2::geom_point() +
ggplot2::labs(
title = "Elbow Method to Determine Optimal Number of Clusters",
x = "Number of Clusters",
y = "Within-Cluster Sum of Squares (WCSS)"
) +
ggplot2::theme_classic()
The plot displayed here is a typical “Elbow plot,” which is
used to help determine the optimal number of clusters for K-means
clustering by looking at the within-cluster sum of squares (WCSS)
against the number of clusters.
In the Elbow method, we look for the point where the WCSS graph
starts to flatten out after a steep decline. The “elbow point”
represents the number of clusters after which adding more clusters
doesn’t significantly reduce the variance within the
clusters.
Based on the plot:
The steep decline from 1 to around 2 or 3 suggests significant
gains in reducing within-cluster variance by increasing the number of
clusters from 1 to 2 or 3.
After around 2 or 3 clusters, the line begins to flatten,
indicating that adding more clusters beyond this point does not yield as
substantial a decrease in WCSS.
The exact “elbow” isn’t always clear-cut, but in this plot, it
appears to be around the 2 or 3 cluster mark. This suggests that 2 or 3
might be the optimal number of clusters for this dataset. Choosing
beyond 3 clusters would likely not result in significantly better
clustering, as per the Elbow method.
Performing K-means clustering with the optimal number of clusters
identified in the previous step. I will use the kmeans() function to
perform K-means clustering on the scaled numeric data, specifying the
number of clusters as 3. I will also set the nstart parameter to 25 to
ensure that the algorithm runs the K-means clustering 25 times with
different initial centroids and selects the best set of initial
centroids to minimize the total within-cluster variance.
# Set seed for reproducibility
set.seed(123)
# Perform K-means clustering
kmeans_results <- kmeans(numeric_data, centers = 3, nstart = 25)
# Adding the cluster assignments to your data
order_late$cluster <- kmeans_results$cluster
order_late %>%
data.frame()
# Summarize the cluster sizes
(table(order_late$cluster)) %>%
data.frame()
kmeans_results$centers %>%
data.frame()
After preprocessing the data to remove non-numeric and
irrelevant columns (loc and order), and performing K-means clustering,
we have the following results:
Average Week Number: 49.47
Average Days to Acknowledge: 55.80
Percentage On Time: 0.00%
This cluster tends to represent orders
acknowledged later in the year (weeks in the late 40s), with a longer
acknowledgment time on average, and none of these orders were
acknowledged on time.
Average Week Number: 11.34
Average Days to Acknowledge: 17.87
Percentage On Time: 13.28%
This cluster includes orders from earlier
in the year (weeks numbered around 11) with significantly shorter
acknowledgment times, and a small proportion of these orders were
acknowledged on time.
Average Week Number: 42.80
Average Days to Acknowledge: 89.36
Percentage On Time: 0.00%
This cluster is associated with orders
placed around the middle of the year (week numbers in the low 40s) that
have the longest acknowledgment times. Similar to Cluster 1, no orders
in this cluster were acknowledged on time.
Clusters are differentiated primarily by days_to_acknowledge and
week_number.
Clusters 1 and 3 have a high average of
days_to_acknowledge, and all their orders are not on time.
Cluster 2
represents more efficient order processing with shorter
days_to_acknowledge and includes some orders that are on time.
The
week_number variable suggests potential seasonal influences on order
acknowledgment times, with different times of the year showing distinct
patterns.
ggplot(order_late, aes(x = week_number, y = days_to_acknowledge, color = as.factor(cluster))) +
geom_point() +
geom_smooth(method = "loess") +
labs(title = "Days to Acknowledge vs. Week Number by Cluster",
x = "Week Number",
y = "Days to Acknowledge",
color = "Cluster") +
theme_classic()
Cluster 1: Shows a trend of increasing acknowledgment time up to
a certain point in the year before declining, which might reflect
operational changes or varying demand.
Cluster 2: Acknowledgment times are consistently quick throughout
the year, indicating efficient processing.
Cluster 3: Acknowledgment times peak around the later weeks of
the year, suggesting a seasonal effect or bottleneck.
The trends suggest that different strategies may be needed
throughout the year to address acknowledgment delays, with particular
attention to the periods indicated by the peaks for Clusters 2 and
3.
order_late %>%
plotly::plot_ly(x = ~week_number, y = ~days_to_acknowledge, z = ~on_time, color = ~factor(cluster), type = "scatter3d", mode = "markers") %>%
plotly::layout(
title = "K-means Clustering: Orders by Week Number, Days to Acknowledge, and On Time",
scene = list(
xaxis = list(title = "Week Number"),
yaxis = list(title = "Days to Acknowledge"),
zaxis = list(title = "On Time")
)
)
- Cluster 2 exhibits the strongest performance, with orders
being on time despite some variation in the number of days to
acknowledge. The cluster’s proximity to ‘1’ on the ‘On Time’ axis
suggests that the processes and strategies in place for these orders are
effective.
Cluster 1 shows orders that are not on time, with acknowledgment
times varying. This cluster requires attention to identify the
bottlenecks that lead to delays.
Cluster 3 also indicates orders not on time, with a broad range
of acknowledgment times. Like Cluster 2, it signals an area needing
significant improvement to meet on-time delivery goals.
For Cluster 2, the objective is to maintain high performance,
perhaps by refining the current strategies and exploring further
efficiencies.
For Cluster 1 and Cluster 3, it’s crucial to conduct a detailed
analysis to understand the causes behind the delays in acknowledgment.
Investigating whether these delays are due to operational, systemic, or
seasonal factors will help in formulating appropriate corrective
measures. Interventions may include process re-engineering, increased
staffing during peak periods, or implementing more effective order
tracking systems.
By learning from the efficiency of Cluster 2 and understanding the underlying issues in Clusters 1 and 3, actionable steps can be implemented to enhance the overall timeliness of order acknowledgments.
ggplot(order_late, aes(x = days_to_acknowledge, fill = as.factor(cluster))) +
geom_histogram(binwidth = 1) +
facet_wrap(~cluster) +
labs(title = "Histogram of Days to Acknowledge by Cluster",
x = "Days to Acknowledge",
y = "Count",
fill = "Cluster") +
theme_classic()
ggplot(order_late, aes(x = factor(cluster), y = days_to_acknowledge, fill = factor(cluster))) +
geom_boxplot() +
labs(title = "Boxplot of Days to Acknowledge by Cluster",
x = "Cluster",
y = "Days to Acknowledge",
fill = "Cluster") +
theme_classic()
- Cluster 2 (Green): This cluster has the majority of its
orders with the lowest days to acknowledge, with a significant count
close to zero. This suggests highly efficient processing and a quick
response to orders. To maintain this performance, processes from this
cluster should be studied and potentially used as a benchmark for other
clusters.
Cluster 1 (Red): The second-best cluster has a wider spread of
acknowledgment times but also includes many orders with lower
acknowledgment times. However, there’s a noticeable tail extending
towards higher acknowledgment days. Actions for this cluster could
include analyzing why some orders take longer to acknowledge and
applying corrective measures to shift the distribution further to the
left.
Cluster 3 (Blue): The worst-performing cluster shows
acknowledgment times that are spread across a broad range with a higher
concentration towards longer days. This indicates a need for significant
improvement. The focus for this cluster should be on identifying and
addressing systemic inefficiencies or operational bottlenecks that lead
to delays.
ggplot(order_late, aes(x = reorder(leader_name, days_to_acknowledge), y = days_to_acknowledge, fill = as.factor(cluster))) +
geom_boxplot() +
theme_classic() +
labs(title = "Days to Acknowledge by Leader and Cluster",
x = "Leader Name",
y = "Days to Acknowledge",
fill = "Cluster") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(order_late, aes(x = reorder(loc, days_to_acknowledge), y = days_to_acknowledge, fill = as.factor(cluster))) +
geom_boxplot() +
theme_classic() +
labs(title = "Days to Acknowledge by Location and Cluster",
x = "Location",
y = "Days to Acknowledge",
fill = "Cluster") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Click the following link for reactive clustering filter data: Reactive Clustering filter Data
Cluster Performance:
- Cluster 2, characterized by orders that
are quickly acknowledged, represents the best performance, serving as a
model for efficient order processing.
- Cluster 1 exhibits a
moderate acknowledgment timeframe, indicating good but improvable
processes.
- Cluster 3 has the most delayed acknowledgments,
identifying it as the primary area needing intervention.
Seasonal and Temporal Effects:
- The acknowledgment times for
certain clusters peak at specific periods of the year, suggesting
seasonal impacts on order processing that require adaptive management
strategies.
Leadership Impact:
- There are clear variations in acknowledgment
times across different leaders. Emulating the practices of leaders with
the most efficient clusters could improve
Location-Specific Insights:
- Some locations are consistently
associated with delayed acknowledgments. Targeted improvements at these
locations could significantly enhance the timeliness of the overall
acknowledgment process.
Actionable Strategies:
- For clusters with delayed
acknowledgments, conduct root cause analyses to address systemic
issues.
- Implement best practices from efficient clusters
across the board to elevate performance.
- Consider seasonal
staffing adjustments and process optimizations to manage peak times
effectively.
- Engage in continuous improvement cycles, using
insights from the data to refine processes regularly.
Monitoring and Continuous Improvement:
- Establish a monitoring
system to ensure that the implemented changes yield the expected
improvements and to quickly identify any backsliding or new
issues.